Normal and Compound Poisson Approximations for Pattern Occurrences in NGS Reads

نویسندگان

  • Zhiyuan Zhai
  • Gesine Reinert
  • Kai Song
  • Michael S. Waterman
  • Yihui Luan
  • Fengzhu Sun
چکیده

Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Approximation of word counts in Markov chains

In this talk, we give an overview about the diierent approximation results existing on the statistical distribution of word counts in a Markov chain. Results concerning the number of overlapping occurrences, the number of non-overlapping occurrences (renewals) and the declumped count will be presented. Counts of single words but also multiple words and word families are considered. We will see ...

متن کامل

Compound Poisson approximation: a user’s guide

Compound Poisson approximation is a useful tool in a variety of applications, including insurance mathematics, reliability theory, and molecular sequence analysis. In this paper, we review the ways in which Stein's method can currently be used to derive bounds on the error in such approximations. The theoretical basis for the construction of error bounds is systematically discussed, and a numbe...

متن کامل

Probabilistic and Statistical Properties of Words: An Overview

In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a...

متن کامل

Numerical solution and simulation of random differential equations with Wiener and compound Poisson Processes

Ordinary differential equations(ODEs) with stochastic processes in their vector field, have lots of applications in science and engineering. The main purpose of this article is to investigate the numerical methods for ODEs with Wiener and Compound Poisson processes in more than one dimension. Ordinary differential equations with Ito diffusion which is a solution of an Ito stochastic differentia...

متن کامل

Empirical Bayes Estimators with Uncertainty Measures for NEF-QVF Populations

The paper proposes empirical Bayes (EB) estimators for simultaneous estimation of means in the natural exponential family (NEF) with quadratic variance functions (QVF) models. Morris (1982, 1983a) characterized the NEF-QVF distributions which include among others the binomial, Poisson and normal distributions. In addition to the EB estimators, we provide approximations to the MSE’s of t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of computational biology : a journal of computational molecular cell biology

دوره 19 6  شماره 

صفحات  -

تاریخ انتشار 2012